-
-
Notifications
You must be signed in to change notification settings - Fork 606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VA: Fix performRemoteValidation goroutine leak #7727
Conversation
No unit tests need to be changed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suppose that the parent gRPC context is cancelled.
Today, the resulting sequence of events is:
- the child
rva.PerformValidation
gRPC detect the cancellation and quit, writing their cancellation errors to the results channel - the
for result := range results
loop reads errors from the results channel until the errors surpass the threshold - it increments the remoteValidationFailures metric, modifies the first error to indicate it occurred during secondary validation, and returns
- the remaining goroutines hang trying to write to the results channel
With this change, the resulting sequence of events becomes racy and non-deterministic, depending on what order the go runtime schedules selects. It might behave like the above (with the exception of leaking the child goroutines), or:
- the for/select loop detects the cancellation and immediately quits, returning a different kind of error describing the cancellation and not incrementing the metric
I don't love this non-determinism.
I think a cleaner solution might be simply to make the results channel a buffered channel with capacity len(va.remoteVAs)
. We know that number is always going to be small -- likely never more than 6, even with the new BRs MPIC requirements -- and the items in it are small because they're all pointers.
This way all the rest of the logic can remain untouched: the child goroutines are guaranteed to be able to write to the results channel and exit, and the loop reading from the results channel can continue to early-return as soon as it gets enough results without having to worry about checking for cancellation itself.
That said, I think creating a new context for the child gRPCs and cancelling it when this function returns is still a good idea, as it will cause them to bail out faster.
0dbd487
to
44129e7
Compare
44129e7
to
faf4324
Compare
PerformValidation goroutines write to a buffered results channel to prevent blocking.